My Bike Rides and Weather Data¶
Table of Contents ¶
-
- Сorrelation Matrix and Pairplot
- Cycling Scatter Plots
- Cycling Treemap Charts
- Cycling Sunburst Charts
- Bike Comparison by Hours Spent and Distance Covered
- Bike Rides by Time Categories
- Temperature
- UV Index and Daylight Hours
- Humidity, Cloudiness and Precipitation
- Wind
- Sea Level Air Pressure
- Lunar Phases
Introduction ¶
I am Cycling Enthusiast whose description matches this article almost word for word.
Since 2015, I began to record my rides through various GPS applications (CycloMeter, GPS Tracker free, Endomondo, Bryton Active and finally Strava).
Until 2022, I just did it on my smartphone and sometimes there were glitches, and if the GPS signal was lost, the track could be partially recorded or not recorded at all. In such cases, I myself drew a gpx track on Google map and wrote down the distance (even if the GPS signal is lost, you will get the elapsed time to calculate the average speed).
Therefore, from the very beginning, I began to duplicate records in the form of Excel spreadsheet. Now I'm riding with a cycling navigator and it doesn't have these problems, but the habit of duplicating records in Excel remains.
As of today there are 449 entries in this table and in just 58 days and 14 hours of pure moving time, 28245.30 km have been covered.
So now that I have become a data analyst, it's time to analyze this data using the power of Python!
But simply analyzing a dataset that I literally felt with my body is quite boring for me.
Why not combine this analysis with weather data on the days I rode?
I don’t cycle in sub-zero temperatures, which are common in Estonia from November to March.
So I was glad when I was able to spend the winter 2021/2022 in Cyprus, where I enjoyed the warmth and cycling. Also, periodically when traveling, I rented a bike and rode in different places around the world.
Different locations, different weather, it took some time to find the source of historical weather data.
It was not easy to get this data completely and for free, but I did it!
Visual Crossing allows you to download 1000 lines of data per day for free. Therefore, I abandoned the idea of downloading weather data hourly and in my dataset one entry corresponds to one day.
You can download historical weather data for any city here, just enter your location and select the dates.
So, I changed many locations and dates, downloaded these individual files as Excel and put them together into one table.
Yes, there is an opportunity to download csv, and yes, I know how combine these files into one using Python or R.
But I preferred to put them all together manually into one xlsx file, check for duplicates (every date is unique) and confirm the integrity (each location corresponds to the correct cycling area) using Excel.
In the end, Excel is also an analytical tool and sometimes working with it is faster and more efficient than fancy programming languages :D
Anyway, I have two files and a great desire to do a high-quality analysis of my data!
This analysis is very personal and very important to me.
I'm excited and looking forward to discovering insights that will help me better understand my hobby and maybe even learn something new about the weather and how it affects my cycling.
Feel free to leave a comment or even upvote my project!¶
Preparing and Processing Data ¶
For this project I will need NumPy, Pandas, Plotly, Seaborn and Matplotlib libraries:
import numpy as np
import pandas as pd
import datetime
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.cm as cm
from matplotlib import pyplot as plt
from math import radians
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
Here is my cycling dataframe:
df = pd.read_excel('/kaggle/input/my-rides-2015-2023/JR data 2015-2023.xlsx', index_col='Kord')
df.head()
| Kuupäev | Aeg | Kaugus | KK | Riik | JR | Nädal | Päev | Sekundid | |
|---|---|---|---|---|---|---|---|---|---|
| Kord | |||||||||
| 1 | 2015-05-01 | 02:03:24 | 39.89 | 19.40 | Eesti | H | 18 | R | 7404 |
| 2 | 2015-05-20 | 00:21:23 | 5.10 | 14.31 | Eesti | H | 21 | K | 1283 |
| 3 | 2015-05-30 | 02:07:44 | 41.30 | 19.40 | Eesti | H | 22 | L | 7664 |
| 4 | 2015-06-05 | 02:21:43 | 40.70 | 17.23 | Eesti | H | 23 | R | 8503 |
| 5 | 2015-06-22 | 02:35:16 | 46.90 | 18.12 | Eesti | H | 26 | E | 9316 |
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 449 entries, 1 to 449 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Kuupäev 449 non-null datetime64[ns] 1 Aeg 449 non-null object 2 Kaugus 449 non-null float64 3 KK 449 non-null float64 4 Riik 449 non-null object 5 JR 449 non-null object 6 Nädal 449 non-null int64 7 Päev 449 non-null object 8 Sekundid 449 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(2), object(4) memory usage: 35.1+ KB
And here is weather data:
weather = pd.read_excel('/kaggle/input/weather-data-mostly-in-tallinn/Weather 2015-04-30 to 2023-11-10.xlsx')
weather.head()
| name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | ... | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tallinn | 2015-04-30 | 11.0 | 3.0 | 6.4 | 11.0 | -1.9 | 3.4 | 2.1 | 75.2 | ... | 18.4 | 7 | NaN | 2015-04-30T05:25:39 | 2015-04-30T21:12:32 | 0.38 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 26135099999,26038099999,EETN,26029099999,26134... |
| 1 | Tallinn | 2015-05-01 | 16.0 | 5.1 | 10.6 | 16.0 | 1.9 | 9.3 | 0.6 | 53.3 | ... | 9.2 | 3 | NaN | 2015-05-01T05:23:00 | 2015-05-01T21:14:57 | 0.41 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 26135099999,26038099999,EETN,26029099999,26134... |
| 2 | Tallinn | 2015-05-02 | 9.0 | 4.0 | 6.6 | 7.0 | 1.0 | 4.0 | 4.5 | 87.1 | ... | 11.8 | 5 | NaN | 2015-05-02T05:20:22 | 2015-05-02T21:17:21 | 0.44 | Rain, Overcast | Cloudy skies throughout the day with rain. | rain | 26135099999,26038099999,EETN,26029099999,26134... |
| 3 | Tallinn | 2015-05-03 | 10.0 | 4.0 | 6.8 | 10.0 | 0.6 | 4.1 | 2.7 | 76.7 | ... | 17.6 | 7 | NaN | 2015-05-03T05:17:45 | 2015-05-03T21:19:45 | 0.47 | Rain, Partially cloudy | Partly cloudy throughout the day with rain. | rain | 26135099999,26038099999,EETN,26029099999,26134... |
| 4 | Tallinn | 2015-05-04 | 10.0 | -0.9 | 6.5 | 10.0 | -2.3 | 5.2 | -1.8 | 59.9 | ... | 18.7 | 6 | NaN | 2015-05-04T05:15:09 | 2015-05-04T21:22:09 | 0.50 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 26135099999,26038099999,EETN,26029099999,26134... |
5 rows × 33 columns
weather.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1986 entries, 0 to 1985 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 1986 non-null object 1 datetime 1986 non-null object 2 tempmax 1986 non-null float64 3 tempmin 1986 non-null float64 4 temp 1986 non-null float64 5 feelslikemax 1986 non-null float64 6 feelslikemin 1986 non-null float64 7 feelslike 1986 non-null float64 8 dew 1986 non-null float64 9 humidity 1986 non-null float64 10 precip 1986 non-null float64 11 precipprob 1986 non-null int64 12 precipcover 1986 non-null float64 13 preciptype 1215 non-null object 14 snow 1946 non-null float64 15 snowdepth 1921 non-null float64 16 windgust 1857 non-null float64 17 windspeed 1986 non-null float64 18 winddir 1986 non-null float64 19 sealevelpressure 1986 non-null float64 20 cloudcover 1986 non-null float64 21 visibility 1986 non-null float64 22 solarradiation 1977 non-null float64 23 solarenergy 1977 non-null float64 24 uvindex 1986 non-null int64 25 severerisk 87 non-null float64 26 sunrise 1986 non-null object 27 sunset 1986 non-null object 28 moonphase 1986 non-null float64 29 conditions 1986 non-null object 30 description 1986 non-null object 31 icon 1986 non-null object 32 stations 1986 non-null object dtypes: float64(22), int64(2), object(9) memory usage: 512.1+ KB
Dataframes Overview ¶
Let's look at both dataframes:
First, cycling dataframe 'JR data 2015-2023.xlsx', which I filled out myself all these years.
A bit of Estonian language and an explanation of what my notes and abbreviations mean in the dateframe:
- Column 'Kord': can be translated as №. I will use it as index column.
- Column 'Kuupäev': date originally in format dd.mm.yyyy, but I see that Pandas immediately reads this as yyyy-mm-dd. Well, it’s even more convenient and now the dates are obtained in both files in the same format.
- Column 'Aeg': time in hh:mm:ss format. Meaning moving time as in Strava. Before Strava, in other applications this parameter was sometimes calculated differently and was essentially elapsed time, but for simplicity I will assume that here it is always time in motion.
- Column 'Kaugus': distance in km as float with two decimal places.
- Column 'KK': "keskmine kiirus" means "average speed" in km/h as float with two decimal places.
- Column 'Riik': country in which the ride took place with variants:
- 'AÜE' – UAE (United Arab Emirates)
- 'Austria'
- 'Eesti' – Estonia
- 'Itaalia' and 'Jaapan' (you can guess both)
- 'Küpros' – Cyprus
- 'Poola' – Poland
- 'Soome' – Finland
- 'USA'
- Column 'JR': "jalgratas" means the "bicycle" on which the ride was made with variants:
- 'C' – City (the only ride on a purchased bike that immediately broke down, long story :D and another random ride on the same type of borrowed bike. Both were used in Cyprus)
- 'E' – rented Electric mountain bike
- 'G' – Gorny-Mountain Muddyfox Energy 26 own steel bike, which I rode in Cyprus
- 'H' – Hübriid-Hybrid Scott Sportster 55 own aluminium bike
- 'M' – Maantee-Road Scott USA AFD Comp own aluminium bike
- 'R' – Rented any non-electric bike from the rental service
- 'T' – Titanium Van Nicholas Skeiron own titanium road bike
- Column 'Nädal': week of the year as number.
- Column 'Päev': day meaning day of the week with variants:
- 'E' – Esmaspäev-Monday
- 'T' – Teisipäev-Tuesday
- 'K' – Kolmapäev-Wednesday
- 'N' – Neljapäev-Thursday
- 'R' – Reede-Friday
- 'L' – Laupäev-Saturday
- 'P' – Pühapäev-Sunday
- Column 'Sekundid': duration of ride in seconds as integer (I used them to accurately calculate the average speed), it corresponds to 'Aeg' column.
Second, weather dataframe 'Weather 2015-04-30 to 2023-11-10.xlsx', which is assembled from 22 separate files by me in Excel.
Dependencies on which weather variables I am interested in analyzing?
Well, I like the temperature, humidity, precipitation, wind, atmospheric pressure, cloud cover, UV index, day length and even the phase of the moon. All this data is in the weather dataset, but there is also a lot of unnecessary stuff.
Here I will describe only those columns that I will use in my analysis:
- Column 'name': this is actually the location that corresponds the weather data.
- Column 'datetime': date in format yyyy-mm-dd.
- Column 'tempmax': maximum daily air temperature in Celsius degrees °C.
- Column 'tempmin': minimum daily air temperature in Celsius degrees °C.
- Column 'temp': average daily air temperature in Celsius degrees °C.
- Column 'humidity': air humidity in %.
- Column 'precip': precipitation (almost always rain, very rarely sleet) in mm.
- Column 'windspeed': wind speed in km/h. I am used to measuring this in m/s, but here it will be convenient to compare the wind speed with my average speed, which is also in km/h.
- Column 'winddir': meteorological wind direction indicates the azimuth of the point from which the wind is blowing, where 0° means north and 90° means east.
- Column 'sealevelpressure': atmospheric pressure at sea level in hectopascals.
- Column 'cloudcover': cloudiness in %.
- Column 'uvindex': UV index according to a known standard.
- Column 'sunrise': timestamp of sunrise by local time.
- Column 'sunset': timestamp of sunset by local time. I will use both of these columns to calculate the length of daylight hours.
- Column 'moonphase': lunar phase as part of one, where 0 means new moon and 0.5 means full moon.
Data Processing ¶
Weather Dataset¶
I think it will be more convenient to create a cleared weather dataframe, where only what is needed will be.
First, let's remove all unnecessary dates. I need the weather only on the days when I rode (I have already previously checked in Excel that the locations correspond to these days).
At the moment 'datetime' column in 'weather' is object, let's change it to datetime64 and remove all unnecessary dates:
weather['datetime'] = pd.to_datetime(weather['datetime'])
weather.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1986 entries, 0 to 1985 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 1986 non-null object 1 datetime 1986 non-null datetime64[ns] 2 tempmax 1986 non-null float64 3 tempmin 1986 non-null float64 4 temp 1986 non-null float64 5 feelslikemax 1986 non-null float64 6 feelslikemin 1986 non-null float64 7 feelslike 1986 non-null float64 8 dew 1986 non-null float64 9 humidity 1986 non-null float64 10 precip 1986 non-null float64 11 precipprob 1986 non-null int64 12 precipcover 1986 non-null float64 13 preciptype 1215 non-null object 14 snow 1946 non-null float64 15 snowdepth 1921 non-null float64 16 windgust 1857 non-null float64 17 windspeed 1986 non-null float64 18 winddir 1986 non-null float64 19 sealevelpressure 1986 non-null float64 20 cloudcover 1986 non-null float64 21 visibility 1986 non-null float64 22 solarradiation 1977 non-null float64 23 solarenergy 1977 non-null float64 24 uvindex 1986 non-null int64 25 severerisk 87 non-null float64 26 sunrise 1986 non-null object 27 sunset 1986 non-null object 28 moonphase 1986 non-null float64 29 conditions 1986 non-null object 30 description 1986 non-null object 31 icon 1986 non-null object 32 stations 1986 non-null object dtypes: datetime64[ns](1), float64(22), int64(2), object(8) memory usage: 512.1+ KB
Now I will create a list with the dates I need.
I will find them in the cycling dataset:
ride_dates = list(df['Kuupäev'])
ride_dates = pd.to_datetime(ride_dates)
ride_dates
DatetimeIndex(['2015-05-01', '2015-05-20', '2015-05-30', '2015-06-05',
'2015-06-22', '2015-06-28', '2015-06-29', '2015-07-05',
'2015-07-25', '2015-07-28',
...
'2023-09-25', '2023-09-26', '2023-09-28', '2023-10-04',
'2023-10-10', '2023-10-13', '2023-10-24', '2023-11-05',
'2023-11-07', '2023-11-09'],
dtype='datetime64[ns]', length=449, freq=None)
I create a new dataset with only the dates I need:
ride_weather = weather[weather['datetime'].isin(ride_dates)]
ride_weather.head()
| name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | ... | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tallinn | 2015-05-01 | 16.0 | 5.1 | 10.6 | 16.0 | 1.9 | 9.3 | 0.6 | 53.3 | ... | 9.2 | 3 | NaN | 2015-05-01T05:23:00 | 2015-05-01T21:14:57 | 0.41 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 26135099999,26038099999,EETN,26029099999,26134... |
| 20 | Tallinn | 2015-05-20 | 16.0 | 9.1 | 12.3 | 16.0 | 7.3 | 12.0 | 6.3 | 68.3 | ... | 18.6 | 6 | NaN | 2015-05-20T04:37:33 | 2015-05-20T21:59:00 | 0.07 | Rain, Partially cloudy | Partly cloudy throughout the day with early mo... | rain | 26135099999,26038099999,EETN,26029099999,26134... |
| 30 | Tallinn | 2015-05-30 | 19.0 | 3.1 | 11.8 | 19.0 | 1.0 | 11.5 | 4.8 | 66.8 | ... | 25.3 | 7 | NaN | 2015-05-30T04:19:21 | 2015-05-30T22:18:53 | 0.39 | Rain, Partially cloudy | Becoming cloudy in the afternoon with late aft... | rain | 26135099999,26038099999,EETN,26029099999,26134... |
| 36 | Tallinn | 2015-06-05 | 17.0 | 9.0 | 12.9 | 17.0 | 6.6 | 12.4 | 7.1 | 70.0 | ... | 27.7 | 8 | NaN | 2015-06-05T04:11:14 | 2015-06-05T22:28:37 | 0.60 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 26135099999,26038099999,EETN,26029099999,26134... |
| 53 | Tallinn | 2015-06-22 | 16.0 | 12.0 | 14.1 | 16.0 | 12.0 | 14.1 | 11.3 | 84.1 | ... | 20.6 | 7 | NaN | 2015-06-22T04:03:14 | 2015-06-22T22:42:33 | 0.19 | Rain, Partially cloudy | Partly cloudy throughout the day with rain cle... | rain | 26135099999,26038099999,EETN,26029099999,26134... |
5 rows × 33 columns
Now let's drop all unnecessary columns.
I think I don't need 18 columns out of 33 and need to reset index:
rw = ride_weather.drop(columns=['feelslikemax', 'feelslikemin', 'feelslike',
'dew', 'precipprob', 'precipcover', 'preciptype',
'snow', 'snowdepth', 'windgust', 'visibility',
'solarradiation', 'solarenergy', 'severerisk',
'conditions', 'description', 'icon', 'stations'])
rw.reset_index(drop=True)
| name | datetime | tempmax | tempmin | temp | humidity | precip | windspeed | winddir | sealevelpressure | cloudcover | uvindex | sunrise | sunset | moonphase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tallinn | 2015-05-01 | 16.0 | 5.1 | 10.6 | 53.3 | 0.000 | 23.9 | 124.7 | 1012.0 | 56.8 | 3 | 2015-05-01T05:23:00 | 2015-05-01T21:14:57 | 0.41 |
| 1 | Tallinn | 2015-05-20 | 16.0 | 9.1 | 12.3 | 68.3 | 0.201 | 22.0 | 255.3 | 1011.6 | 64.4 | 6 | 2015-05-20T04:37:33 | 2015-05-20T21:59:00 | 0.07 |
| 2 | Tallinn | 2015-05-30 | 19.0 | 3.1 | 11.8 | 66.8 | 0.399 | 23.9 | 125.2 | 1010.3 | 32.5 | 7 | 2015-05-30T04:19:21 | 2015-05-30T22:18:53 | 0.39 |
| 3 | Tallinn | 2015-06-05 | 17.0 | 9.0 | 12.9 | 70.0 | 0.000 | 34.8 | 261.2 | 1021.9 | 38.9 | 8 | 2015-06-05T04:11:14 | 2015-06-05T22:28:37 | 0.60 |
| 4 | Tallinn | 2015-06-22 | 16.0 | 12.0 | 14.1 | 84.1 | 4.171 | 14.5 | 294.5 | 1009.9 | 68.6 | 7 | 2015-06-22T04:03:14 | 2015-06-22T22:42:33 | 0.19 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 444 | Tallinn | 2023-10-13 | 13.0 | 6.1 | 9.1 | 76.4 | 0.011 | 30.5 | 238.8 | 1002.8 | 49.7 | 4 | 2023-10-13T07:53:17 | 2023-10-13T18:20:15 | 0.96 |
| 445 | Tallinn | 2023-10-24 | 6.0 | -0.9 | 3.3 | 85.8 | 0.622 | 14.8 | 84.0 | 1015.5 | 79.3 | 2 | 2023-10-24T08:19:57 | 2023-10-24T17:49:29 | 0.34 |
| 446 | Tallinn | 2023-11-05 | 9.0 | 5.0 | 6.8 | 93.7 | 0.119 | 16.5 | 138.4 | 991.4 | 80.1 | 1 | 2023-11-05T07:49:48 | 2023-11-05T16:18:27 | 0.75 |
| 447 | Tallinn | 2023-11-07 | 8.0 | 7.0 | 7.6 | 95.3 | 0.660 | 16.5 | 233.2 | 998.2 | 92.7 | 1 | 2023-11-07T07:54:48 | 2023-11-07T16:13:39 | 0.81 |
| 448 | Tallinn | 2023-11-09 | 8.0 | 5.0 | 6.4 | 96.2 | 0.091 | 14.6 | 190.0 | 1005.1 | 96.9 | 1 | 2023-11-09T07:59:48 | 2023-11-09T16:08:57 | 0.87 |
449 rows × 15 columns
Also I don't like column name 'name', let it be 'location':
rw = rw.rename(columns={'name': 'location'})
rw.info()
<class 'pandas.core.frame.DataFrame'> Index: 449 entries, 1 to 1984 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 location 449 non-null object 1 datetime 449 non-null datetime64[ns] 2 tempmax 449 non-null float64 3 tempmin 449 non-null float64 4 temp 449 non-null float64 5 humidity 449 non-null float64 6 precip 449 non-null float64 7 windspeed 449 non-null float64 8 winddir 449 non-null float64 9 sealevelpressure 449 non-null float64 10 cloudcover 449 non-null float64 11 uvindex 449 non-null int64 12 sunrise 449 non-null object 13 sunset 449 non-null object 14 moonphase 449 non-null float64 dtypes: datetime64[ns](1), float64(10), int64(1), object(3) memory usage: 56.1+ KB
Now let's calculate daylight time from 'sunrise' and 'sunset' columns.
I will change their data type from object to datetime64[ns] first:
rw['sunrise'] = pd.to_datetime(rw['sunrise'])
rw['sunset'] = pd.to_datetime(rw['sunset'])
Then I will do the calculation:
rw['daylength'] = (rw['sunset'] - rw['sunrise'])
Finally I will convert this time into seconds in new column:
rw['daysec'] = rw['daylength'] / np.timedelta64(1, 's')
rw.head()
| location | datetime | tempmax | tempmin | temp | humidity | precip | windspeed | winddir | sealevelpressure | cloudcover | uvindex | sunrise | sunset | moonphase | daylength | daysec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tallinn | 2015-05-01 | 16.0 | 5.1 | 10.6 | 53.3 | 0.000 | 23.9 | 124.7 | 1012.0 | 56.8 | 3 | 2015-05-01 05:23:00 | 2015-05-01 21:14:57 | 0.41 | 0 days 15:51:57 | 57117.0 |
| 20 | Tallinn | 2015-05-20 | 16.0 | 9.1 | 12.3 | 68.3 | 0.201 | 22.0 | 255.3 | 1011.6 | 64.4 | 6 | 2015-05-20 04:37:33 | 2015-05-20 21:59:00 | 0.07 | 0 days 17:21:27 | 62487.0 |
| 30 | Tallinn | 2015-05-30 | 19.0 | 3.1 | 11.8 | 66.8 | 0.399 | 23.9 | 125.2 | 1010.3 | 32.5 | 7 | 2015-05-30 04:19:21 | 2015-05-30 22:18:53 | 0.39 | 0 days 17:59:32 | 64772.0 |
| 36 | Tallinn | 2015-06-05 | 17.0 | 9.0 | 12.9 | 70.0 | 0.000 | 34.8 | 261.2 | 1021.9 | 38.9 | 8 | 2015-06-05 04:11:14 | 2015-06-05 22:28:37 | 0.60 | 0 days 18:17:23 | 65843.0 |
| 53 | Tallinn | 2015-06-22 | 16.0 | 12.0 | 14.1 | 84.1 | 4.171 | 14.5 | 294.5 | 1009.9 | 68.6 | 7 | 2015-06-22 04:03:14 | 2015-06-22 22:42:33 | 0.19 | 0 days 18:39:19 | 67159.0 |
Now let's convert 'daylength' column to string and remove useless prefix '0 days':
rw["daylength"] = rw["daylength"].astype(str)
rw["daylength"] = rw["daylength"].str.removeprefix('0 days ')
To build plots grouped by different time categories, I will also need 'year', 'month' and 'season' columns:
rw['year'] = pd.DatetimeIndex(rw['datetime']).year
rw['month'] = pd.DatetimeIndex(rw['datetime']).month
mnt = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
rw['month'] = rw['month'].replace(mnt)
ys = {'Jan': 'Winter', 'Feb': 'Winter', 'Mar': 'Spring', 'Apr': 'Spring',
'May': 'Spring', 'Jun': 'Summer', 'Jul': 'Summer', 'Aug': 'Summer',
'Sep': 'Autumn', 'Oct': 'Autumn', 'Nov': 'Autumn', 'Dec': 'Winter'}
rw['season'] = rw['month'].replace(ys)
rw.head()
| location | datetime | tempmax | tempmin | temp | humidity | precip | windspeed | winddir | sealevelpressure | cloudcover | uvindex | sunrise | sunset | moonphase | daylength | daysec | year | month | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Tallinn | 2015-05-01 | 16.0 | 5.1 | 10.6 | 53.3 | 0.000 | 23.9 | 124.7 | 1012.0 | 56.8 | 3 | 2015-05-01 05:23:00 | 2015-05-01 21:14:57 | 0.41 | 15:51:57 | 57117.0 | 2015 | May | Spring |
| 20 | Tallinn | 2015-05-20 | 16.0 | 9.1 | 12.3 | 68.3 | 0.201 | 22.0 | 255.3 | 1011.6 | 64.4 | 6 | 2015-05-20 04:37:33 | 2015-05-20 21:59:00 | 0.07 | 17:21:27 | 62487.0 | 2015 | May | Spring |
| 30 | Tallinn | 2015-05-30 | 19.0 | 3.1 | 11.8 | 66.8 | 0.399 | 23.9 | 125.2 | 1010.3 | 32.5 | 7 | 2015-05-30 04:19:21 | 2015-05-30 22:18:53 | 0.39 | 17:59:32 | 64772.0 | 2015 | May | Spring |
| 36 | Tallinn | 2015-06-05 | 17.0 | 9.0 | 12.9 | 70.0 | 0.000 | 34.8 | 261.2 | 1021.9 | 38.9 | 8 | 2015-06-05 04:11:14 | 2015-06-05 22:28:37 | 0.60 | 18:17:23 | 65843.0 | 2015 | Jun | Summer |
| 53 | Tallinn | 2015-06-22 | 16.0 | 12.0 | 14.1 | 84.1 | 4.171 | 14.5 | 294.5 | 1009.9 | 68.6 | 7 | 2015-06-22 04:03:14 | 2015-06-22 22:42:33 | 0.19 | 18:39:19 | 67159.0 | 2015 | Jun | Summer |
Okay, finally all the required columns are there and have the correct data types.
After everything is done, I save the result in Excel:
rw.to_excel('rw.xlsx', index=None, sheet_name='Weather')
Cycling Dataset¶
For the sake of consistency, I will also rename the columns there:
df = df.rename(columns={'Kuupäev': 'date', 'Aeg': 'duration', 'Kaugus': 'distance',
'KK': 'avgspeed', 'Riik': 'country', 'JR': 'bike',
'Nädal': 'week', 'Päev': 'dayoftheweek', 'Sekundid': 'seconds'})
And will translate all the words and abbreviations in my dataframe into English.
To do this I create a dictionaries and use replace() function:
riik = {'AÜE': 'UAE', 'Eesti': 'Estonia', 'Itaalia': 'Italy', 'Jaapan': 'Japan',
'Küpros': 'Cyprus', 'Poola': 'Poland', 'Soome': 'Finland'}
jr = {'C': 'City', 'E': 'E-MTB', 'G': 'MTB St', 'H': 'Hybrid Al',
'M': 'Road Al', 'R': 'Rented', 'T': 'Road Ti'}
days = {'E': 'MON', 'T': 'TUE', 'K': 'WED', 'N': 'THU', 'R': 'FRI',
'L': 'SAT', 'P': 'SUN'}
df['country'] = df['country'].replace(riik)
df['bike'] = df['bike'].replace(jr)
df['dayoftheweek'] = df['dayoftheweek'].replace(days)
OK, then right now I need to reboot my computer, so I will save the cleaned modified dataframe to an Excel file:
df.to_excel('df.xlsx', index=None, sheet_name='Rides')
Concatenation of the Dataframes ¶
After restart, installing some Windows updates and a cup of tea, let's read xlsx files:
df = pd.read_excel('df.xlsx')
df.head()
| date | duration | distance | avgspeed | country | bike | week | dayoftheweek | seconds | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-05-01 | 02:03:24 | 39.89 | 19.40 | Estonia | Hybrid Al | 18 | FRI | 7404 |
| 1 | 2015-05-20 | 00:21:23 | 5.10 | 14.31 | Estonia | Hybrid Al | 21 | WED | 1283 |
| 2 | 2015-05-30 | 02:07:44 | 41.30 | 19.40 | Estonia | Hybrid Al | 22 | SAT | 7664 |
| 3 | 2015-06-05 | 02:21:43 | 40.70 | 17.23 | Estonia | Hybrid Al | 23 | FRI | 8503 |
| 4 | 2015-06-22 | 02:35:16 | 46.90 | 18.12 | Estonia | Hybrid Al | 26 | MON | 9316 |
rw = pd.read_excel('rw.xlsx')
rw.head()
| location | datetime | tempmax | tempmin | temp | humidity | precip | windspeed | winddir | sealevelpressure | cloudcover | uvindex | sunrise | sunset | moonphase | daylength | daysec | year | month | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tallinn | 2015-05-01 | 16.0 | 5.1 | 10.6 | 53.3 | 0.000 | 23.9 | 124.7 | 1012.0 | 56.8 | 3 | 2015-05-01 05:23:00 | 2015-05-01 21:14:57 | 0.41 | 15:51:57 | 57117 | 2015 | May | Spring |
| 1 | Tallinn | 2015-05-20 | 16.0 | 9.1 | 12.3 | 68.3 | 0.201 | 22.0 | 255.3 | 1011.6 | 64.4 | 6 | 2015-05-20 04:37:33 | 2015-05-20 21:59:00 | 0.07 | 17:21:27 | 62487 | 2015 | May | Spring |
| 2 | Tallinn | 2015-05-30 | 19.0 | 3.1 | 11.8 | 66.8 | 0.399 | 23.9 | 125.2 | 1010.3 | 32.5 | 7 | 2015-05-30 04:19:21 | 2015-05-30 22:18:53 | 0.39 | 17:59:32 | 64772 | 2015 | May | Spring |
| 3 | Tallinn | 2015-06-05 | 17.0 | 9.0 | 12.9 | 70.0 | 0.000 | 34.8 | 261.2 | 1021.9 | 38.9 | 8 | 2015-06-05 04:11:14 | 2015-06-05 22:28:37 | 0.60 | 18:17:23 | 65843 | 2015 | Jun | Summer |
| 4 | Tallinn | 2015-06-22 | 16.0 | 12.0 | 14.1 | 84.1 | 4.171 | 14.5 | 294.5 | 1009.9 | 68.6 | 7 | 2015-06-22 04:03:14 | 2015-06-22 22:42:33 | 0.19 | 18:39:19 | 67159 | 2015 | Jun | Summer |
Seems legit :)
Now I can concatenate them into one file for further analysis:
df_concat = pd.concat([df, rw], axis=1)
These columns are no needed anymore:
df = df_concat.drop(columns=['datetime', 'sunrise', 'sunset'])
Hooray! Now we have a united file!
Let's save it also as xlsx to have a backup in case something goes wrong when manipulating data or with PC:
df.to_excel('JR_ilm.xlsx', index=None, sheet_name='JR_ilm')
df = pd.read_excel('JR_ilm.xlsx')
df.head()
| date | duration | distance | avgspeed | country | bike | week | dayoftheweek | seconds | location | ... | winddir | sealevelpressure | cloudcover | uvindex | moonphase | daylength | daysec | year | month | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-05-01 | 02:03:24 | 39.89 | 19.40 | Estonia | Hybrid Al | 18 | FRI | 7404 | Tallinn | ... | 124.7 | 1012.0 | 56.8 | 3 | 0.41 | 15:51:57 | 57117 | 2015 | May | Spring |
| 1 | 2015-05-20 | 00:21:23 | 5.10 | 14.31 | Estonia | Hybrid Al | 21 | WED | 1283 | Tallinn | ... | 255.3 | 1011.6 | 64.4 | 6 | 0.07 | 17:21:27 | 62487 | 2015 | May | Spring |
| 2 | 2015-05-30 | 02:07:44 | 41.30 | 19.40 | Estonia | Hybrid Al | 22 | SAT | 7664 | Tallinn | ... | 125.2 | 1010.3 | 32.5 | 7 | 0.39 | 17:59:32 | 64772 | 2015 | May | Spring |
| 3 | 2015-06-05 | 02:21:43 | 40.70 | 17.23 | Estonia | Hybrid Al | 23 | FRI | 8503 | Tallinn | ... | 261.2 | 1021.9 | 38.9 | 8 | 0.60 | 18:17:23 | 65843 | 2015 | Jun | Summer |
| 4 | 2015-06-22 | 02:35:16 | 46.90 | 18.12 | Estonia | Hybrid Al | 26 | MON | 9316 | Tallinn | ... | 294.5 | 1009.9 | 68.6 | 7 | 0.19 | 18:39:19 | 67159 | 2015 | Jun | Summer |
5 rows × 26 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 449 entries, 0 to 448 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 449 non-null datetime64[ns] 1 duration 449 non-null object 2 distance 449 non-null float64 3 avgspeed 449 non-null float64 4 country 449 non-null object 5 bike 449 non-null object 6 week 449 non-null int64 7 dayoftheweek 449 non-null object 8 seconds 449 non-null int64 9 location 449 non-null object 10 tempmax 449 non-null float64 11 tempmin 449 non-null float64 12 temp 449 non-null float64 13 humidity 449 non-null float64 14 precip 449 non-null float64 15 windspeed 449 non-null float64 16 winddir 449 non-null float64 17 sealevelpressure 449 non-null float64 18 cloudcover 449 non-null float64 19 uvindex 449 non-null int64 20 moonphase 449 non-null float64 21 daylength 449 non-null object 22 daysec 449 non-null int64 23 year 449 non-null int64 24 month 449 non-null object 25 season 449 non-null object dtypes: datetime64[ns](1), float64(12), int64(5), object(8) memory usage: 91.3+ KB
I am fine with the 'duration' and 'daylength' columns as objects because the ride time as an integer is in the 'seconds' column and the day length as an integer is in the 'daysec' column. These seconds will be used for further comparison.
And the values in columns 'duration' and 'daylength' will be used just for labeling.
Analysis ¶
Hybrid Al – Scott Sportster 55, aluminium, owned from cycling day one until now
Сorrelation Matrix and Pairplot ¶
To quickly understand how data variables are related to each other, let's make a correlation between numeric values and visualize it:
df_corr = df.corr(numeric_only=True)
df_corr
| distance | avgspeed | week | seconds | tempmax | tempmin | temp | humidity | precip | windspeed | winddir | sealevelpressure | cloudcover | uvindex | moonphase | daysec | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| distance | 1.000000 | 0.601072 | 0.131828 | 0.953935 | 0.080134 | 0.114583 | 0.107378 | 0.084545 | -0.056626 | -0.148793 | 0.072002 | -0.068474 | 0.048125 | -0.039250 | 0.057973 | 0.039371 | 0.427313 |
| avgspeed | 0.601072 | 1.000000 | 0.190391 | 0.385153 | 0.186078 | 0.226835 | 0.224974 | 0.085780 | 0.019644 | -0.093695 | 0.040839 | -0.071066 | -0.007873 | 0.067849 | 0.058080 | 0.269704 | 0.439407 |
| week | 0.131828 | 0.190391 | 1.000000 | 0.086420 | -0.181326 | 0.100647 | -0.076677 | 0.593181 | 0.077152 | -0.033526 | -0.056306 | -0.062784 | 0.379624 | -0.566914 | 0.093840 | -0.474195 | -0.092996 |
| seconds | 0.953935 | 0.385153 | 0.086420 | 1.000000 | 0.008766 | 0.031256 | 0.025254 | 0.066011 | -0.087320 | -0.144831 | 0.078700 | -0.050505 | 0.060589 | -0.083038 | 0.053397 | -0.059565 | 0.374242 |
| tempmax | 0.080134 | 0.186078 | -0.181326 | 0.008766 | 1.000000 | 0.799318 | 0.965142 | -0.297530 | -0.002638 | -0.027130 | -0.095968 | -0.044492 | -0.474493 | 0.625859 | -0.049319 | 0.641350 | -0.119266 |
| tempmin | 0.114583 | 0.226835 | 0.100647 | 0.031256 | 0.799318 | 1.000000 | 0.911034 | 0.069978 | 0.116851 | 0.020589 | -0.020092 | -0.177007 | -0.063892 | 0.340648 | -0.041012 | 0.417439 | -0.080087 |
| temp | 0.107378 | 0.224974 | -0.076677 | 0.025254 | 0.965142 | 0.911034 | 1.000000 | -0.197768 | 0.033721 | -0.014835 | -0.075343 | -0.095278 | -0.348603 | 0.556183 | -0.046899 | 0.602995 | -0.107075 |
| humidity | 0.084545 | 0.085780 | 0.593181 | 0.066011 | -0.297530 | 0.069978 | -0.197768 | 1.000000 | 0.209883 | -0.117599 | 0.018796 | -0.211656 | 0.644904 | -0.620946 | 0.036392 | -0.493344 | -0.050228 |
| precip | -0.056626 | 0.019644 | 0.077152 | -0.087320 | -0.002638 | 0.116851 | 0.033721 | 0.209883 | 1.000000 | 0.040278 | 0.022116 | -0.117257 | 0.170482 | -0.034126 | 0.049733 | -0.027432 | -0.012694 |
| windspeed | -0.148793 | -0.093695 | -0.033526 | -0.144831 | -0.027130 | 0.020589 | -0.014835 | -0.117599 | 0.040278 | 1.000000 | 0.113959 | -0.092456 | -0.045936 | 0.046907 | -0.079349 | -0.033127 | 0.046056 |
| winddir | 0.072002 | 0.040839 | -0.056306 | 0.078700 | -0.095968 | -0.020092 | -0.075343 | 0.018796 | 0.022116 | 0.113959 | 1.000000 | -0.136402 | 0.037741 | 0.041630 | 0.018818 | 0.031111 | 0.090506 |
| sealevelpressure | -0.068474 | -0.071066 | -0.062784 | -0.050505 | -0.044492 | -0.177007 | -0.095278 | -0.211656 | -0.117257 | -0.092456 | -0.136402 | 1.000000 | -0.228686 | 0.063509 | 0.067828 | -0.080196 | -0.045008 |
| cloudcover | 0.048125 | -0.007873 | 0.379624 | 0.060589 | -0.474493 | -0.063892 | -0.348603 | 0.644904 | 0.170482 | -0.045936 | 0.037741 | -0.228686 | 1.000000 | -0.643200 | 0.056400 | -0.490965 | 0.041377 |
| uvindex | -0.039250 | 0.067849 | -0.566914 | -0.083038 | 0.625859 | 0.340648 | 0.556183 | -0.620946 | -0.034126 | 0.046907 | 0.041630 | 0.063509 | -0.643200 | 1.000000 | -0.045713 | 0.808713 | -0.105603 |
| moonphase | 0.057973 | 0.058080 | 0.093840 | 0.053397 | -0.049319 | -0.041012 | -0.046899 | 0.036392 | 0.049733 | -0.079349 | 0.018818 | 0.067828 | 0.056400 | -0.045713 | 1.000000 | -0.030540 | 0.004869 |
| daysec | 0.039371 | 0.269704 | -0.474195 | -0.059565 | 0.641350 | 0.417439 | 0.602995 | -0.493344 | -0.027432 | -0.033127 | 0.031111 | -0.080196 | -0.490965 | 0.808713 | -0.030540 | 1.000000 | -0.125939 |
| year | 0.427313 | 0.439407 | -0.092996 | 0.374242 | -0.119266 | -0.080087 | -0.107075 | -0.050228 | -0.012694 | 0.046056 | 0.090506 | -0.045008 | 0.041377 | -0.105603 | 0.004869 | -0.125939 | 1.000000 |
plt.figure(figsize = (16, 9))
sns.heatmap(df_corr, annot=True, fmt='.2f', cmap='coolwarm', center=0,
vmin=-1, vmax=1)
<Axes: >
I prefer a triangular correlation matrix in which combinations are not repeated:
matrix = np.triu(df.corr(numeric_only=True))
plt.figure(figsize = (16, 9))
sns.heatmap(df_corr, annot=True, fmt='.2f', cmap='coolwarm', center=0,
vmin=-1, vmax=1, mask=matrix)
<Axes: >
Positive Correlation¶
The correlation matrix indicates an obvious positive covariance between any pair of temperature measures.
Outside of temperature, the most powerfully positively correlated indicators are 'distance' and duration of the ride in 'seconds'. The greater the distance covered, the longer it will take.
Day length in seconds ('daysec') as expected have a strong correlation with UV index (0.81) and temperature records.
A strong linear relationship (>0.6) is also observed between distance and average speed (0.60), 'tempmax' and UV index (0.63), as well as between 'humidity' and 'cloudcover' (0.64).
I usually do long distance rides on the highway, and the average speed there is always higher than in the case of shorter rides in the city or forest.
Each year I cycled further (0.43), faster (0.44) and longer (0.37), and this is reflected in the positive correlation between 'year' and the relevant variables.
Negative Correlation¶
It’s clear that the UV index becomes lower as cloudiness and humidity increase (-0.65 and -0.62).
In our climate, autumn and winter are characterized by increased cloudiness and humidity, as well as a decrease in daylight hours, which is confirmed here by the negative correlation (-0.49) between day length ('daysec') and these indicators.
There is also a negative relationship between 'tempmax' and 'cloudcover' (-0.47).
Any correlations with the week number can be safely ignored.
Pairplot¶
To plot multiple pairwise bivariate distributions in a dataset, I will use the pairplot() function:
import warnings
warnings.filterwarnings('ignore')
pp = sns.pairplot(df, hue='bike', corner=True,
palette={'Hybrid Al': 'green', 'Rented': 'red', 'Road Al': 'blue', 'E-MTB': 'purple',
'City': 'yellow', 'MTB St': 'orange', 'Road Ti': 'black'})
plt.setp(pp._legend.get_title(), fontsize=30)
plt.setp(pp._legend.get_texts(), fontsize=25)
pp.savefig('pairplot.png')
plt.show()
I have enlarged the legend fonts to make it more readable.
The pairplot looks amazing, but since it displays 17 variables at once and consists of 153 subplots, it is still quite difficult to read. So to look at this in detail, I save the output as a png image.
Wow, what a huge warning I got from Kaggle the first time I ran this block of code!
It made the pairplot anyway, but I don't like to have this mess. I don't want such a big warning in my analysis, so I just re-ran this block of code, prefixed with the command to ignore warnings.
As collateral damage, this command disables warnings for all subsequent code blocks.
Well, in this case, some outputs will definitely become more compact and beautiful :D
And by the way, there were no any warnings with pairplot on my PC.
Cycling Scatter Plots ¶
df['date'] = pd.to_datetime(df['date']).dt.date
px.scatter(data_frame=df,
x=(df.seconds/60/60),
y='distance',
height=500,
color='bike',
hover_data=['date', 'duration', 'avgspeed', 'country'],
color_discrete_sequence=['green', 'red', 'blue', 'purple', 'yellow', 'orange', 'black'],
labels={'x': 'Hours', 'distance': 'Distance km', 'bike': 'Bike', 'date': 'Date',
'duration': 'Duration', 'avgspeed': 'Avg Speed km/h', 'country': 'Country'},
title='My Bike Rides')
This is my main scatterplot with all recorded bike rides. Zoom in and out if you want to observe any of them in more detail.
Hover data consist general information about the ride. Since I want 'date' to be displayed as yyyy-mm-dd I formatted it again before building this plot (to remove '00:00:00' part).
Through all my analysis I will continue to use this color sequence for bikes so you always know that Hybrid Al is green or MTB St is orange.
px.scatter_geo(data_frame=df,
locations='country',
locationmode = 'country names',
color='country',
labels={'country': 'Country', 'distance': 'Distance km', 'date': 'Date',
'duration': 'Duration', 'avgspeed': 'Avg Speed km/h'},
hover_data=['date', 'duration', 'avgspeed'],
height=450,
size='distance',
size_max=30,
title='My Bike Rides All Over the World')
Geographical scatter plot shows the countries where the bike rides took place. 9 countries, 3 continents – the world map is necessary here.
Again, feel free to zoom in and out of the map, take advantage of everything we love about Plotly!
These scatterplots will also show how the distance and average speed of my rides change over time:
fig = px.scatter(df,
x='date',
y='distance',
color='bike',
size='distance',
size_max = 20,
hover_data='country',
color_discrete_sequence=['green', 'red', 'blue', 'purple',
'yellow', 'orange', 'black'],
labels={'distance': 'Distance km', 'bike': 'Bike',
'date': 'Date', 'country': 'Country'},
title='Distribution of Bike Rides by Time and Distance')
fig.update_layout(xaxis_title=None)
fig = px.scatter(df,
x='date',
y='avgspeed',
color='bike',
size='avgspeed',
size_max = 20,
hover_data='country',
color_discrete_sequence=['green', 'red', 'blue', 'purple',
'yellow', 'orange', 'black'],
labels={'avgspeed': 'Avg Speed km/h', 'bike': 'Bike',
'date': 'Date', 'country': 'Country'},
title='Distribution of Bike Rides by Time and Average Speed')
fig.update_layout(xaxis_title=None)
If the maximum distance covered in one ride has grown almost linearly since 2020 in increments of 50 km (such rides are the main event of the cycling year), then the maximum speed in one ride does not grow steadily from year to year, but has two peaks in 2021 (28.15 km/h) and 2022 (28.42 km/h).
With the exception of one record-breaking Road Al ride in June 2021, all rides with average speed over 25 km/h were made on Road Ti.
Van Nicholas Skeiron was expensive, but it was worth it – titanium bike is ideal for long, multi-hour rides of several hundred kilometers.
It is not surprising that all rides with an average speed of less than 10 km/h take place abroad.
Places like Abu Dhabi or the National Mall in Washington, D.C. make you want to look around, take photos and only occasionally pedal to move from one attraction to another.
Road Ti – Van Nicholas Skeiron, titanium, owned since 2022
Cycling Treemap Charts ¶
px.treemap(df,
path=[px.Constant('Rides'), 'country', 'bike'],
height=700,
color='distance',
labels={'distance': 'Avg km'},
color_continuous_scale='rainbow',
hover_name='distance',
title='Distance Treemap Categorized by Country and Bike')
Treemap is categorized by country and bike and showing average distance in km for each category as 'avg km'.
px.treemap(df,
path=[px.Constant('Rides'), 'country', 'bike', 'date'],
height=700,
color='distance',
labels={'distance': 'Distance', 'date': 'Date', 'duration': 'Duration'},
color_continuous_scale='Portland',
hover_name='distance',
title='Distance Treemap of Rides Categorized by Country and Bike',
hover_data=['date', 'duration'])
The same treemap with extended path, describing literally each ride.
Here average distance and count remains for categories, but each ride alone shows in hover data exact values of date, duration and distance.
This treemap represents average speed categorized by country and bike:
px.treemap(df,
path=[px.Constant('Rides'), 'country', 'bike'],
height=700,
color='avgspeed',
labels={'avgspeed': 'km/h'},
color_continuous_scale='rainbow',
hover_name='avgspeed',
title='Average Speed Treemap Categorized by Country and Bike')
Treemap charts visualize hierarchical data using nested rectangles. That is, they can depict literally every line in my dataset in a completely readable form.
Let's use this advantage again regarding the duration of bike rides. This time instead of the distance I will show the average speed in hover data.
px.treemap(df,
path=[px.Constant('Rides'), 'country', 'bike', 'date'],
height=700,
color=df.seconds/60/60,
labels={'color': 'Hours', 'date': 'Date', 'duration': 'Duration',
'avgspeed': 'Avg Speed km/h'},
color_continuous_scale='rainbow',
hover_name='duration',
hover_data=['date', 'duration', 'avgspeed'],
title='Duration Treemap of Rides Categorized by Country and Bike')
From all treemaps I see that the Road Titanium bike is the busiest and fastest, with an average ride duration of 3,866 hours (03:51:58), average distance of 92.89 km and average 'average speed' :D of 23.48 km/h.
This bike also holds the record for my longest ride ever: 311.3 km and a duration of 11:34:15. It wasn't that hard by the way.
Average Ride Duration by Bike¶
It seems to me quite difficult to immediately understand what the average time is shown for each bike in the form of an hours as floats.
Let's quickly calculate and visualize time averages in an easy-to-understand way.
I get average values using mean() function and use round(0) function to get integers:
ard = df[['bike', 'seconds']]
ad = ard.groupby('bike')['seconds'].mean()
ad.round(0)
bike City 6007.0 E-MTB 6330.0 Hybrid Al 10801.0 MTB St 10461.0 Rented 8146.0 Road Al 11059.0 Road Ti 13918.0 Name: seconds, dtype: float64
To convert seconds to hh:mm:ss format I use datetime.timedelta() function:
sec = 13918
convert = str(datetime.timedelta(seconds = sec))
print(convert)
3:51:58
Now let's manually create a little dataframe with these values.
I will arrange the bikes in chronological order of their appearance in the dataframe:
data = {'bike': ['Hybrid Al', 'Rented', 'Road Al', 'E-MTB', 'City', 'MTB St', 'Road Ti'],
'seconds': [10801, 8146, 11059, 6330, 6007, 10461, 13918],
'duration': ['03:00:01', '02:15:46', '03:04:19', '01:45:30', '01:40:07',
'02:54:21', '03:51:58']}
ad = pd.DataFrame(data)
fig = px.bar(ad,
x='bike',
y=ad.seconds/60/60,
color='bike',
height=500,
hover_data='duration',
color_discrete_sequence=['green', 'red', 'blue', 'purple',
'yellow', 'orange', 'black'],
labels={'y': 'Hours', 'bike': 'Bike', 'duration': 'Avg Duration'},
title='Average Ride Duration by Bike')
fig.update_layout(showlegend=False)
Now hover data user-friendly displays average ride duration by bike.
Cycling Sunburst Charts ¶
Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves.
In addition to cool animation and attractive appearance, they have another important difference from treemap charts.
If you compare sunburst hover data with hover data from treemap graphs, you will notice that it is different, and sometimes quite significantly. Why is that?
According to Plotly sunburst documentation: "If a color argument is passed, the color of a node is computed as the average of the color values of its children, weighted by their values".
So sunburst plot does not show average value of sector in hover data, but weighted average, calculated by formula:
The easiest way to understand the difference between an arithmetic average and a weighted average is to use the example of average speed.
If you rode 100 kilometers at an average speed of 20 km/h, and then 10 km at an average speed of 30 km/h, then the arithmetic average between the speeds would be (20+30) / 2 = 25 km/h.
That is, you drove 110 km at an average speed of 25 km/h. Is it so?
In fact, separately it took 5 hours and 20 minutes, but arithmetic average flatters you with 04:24.
But if you calculate weighted average the result will be similar to the truth.
Weigth of value here will be distance, so: (20×100 + 30×10) / (100+10) = (2000+300) / 110 = 20.9 km/h.
This average speed gives us total moving time of 05:15:39, which you agree is much closer to 05:20.
Now you know how Strava calculates your total moving time combining all the segments between pauses together!
px.sunburst(data_frame=df,
path=['country', 'location', 'bike'],
values=df.seconds/60/60,
height=600,
color='distance',
color_continuous_scale='RdBu_r',
labels={'distance': 'Wt Avg km', 'values': 'Hours'},
title='Bikes Grouped Geographically and Colored by Weigthed Average Distance')
Here is sunburst plot about the distance.
And you are not surprised why there is such a difference between the distance for MTB St 55.97 km and 45.50 km in the treemap chart. All rides of this bike took place in one location and in one country Ayia Napa, Cyprus, so the values are directly comparable.
px.sunburst(data_frame=df,
path=['country', 'location', 'bike'],
values='distance',
height=600,
color='avgspeed',
color_continuous_scale='hsv',
labels={'avgspeed': 'Wt Avg km/h', 'distance': 'Distance km'},
title='Bikes Grouped Geographically and Colored by Weigthed Average Speed')
It's the same story with the weighted average speed sunburst and the average speed treemap.
px.sunburst(data_frame=df,
path=['country', 'location', 'bike'],
values='distance',
height=600,
color=df.seconds/60/60,
color_continuous_scale='viridis',
labels={'color': 'Wt Avg Hours', 'distance': 'Distance km'},
title='Bikes Grouped Geographically and Colored by Weigthed Average Ride Duration in Hours')
Sunburst graph shows total distance and weighted average ride duration in hours by country, location and bike.
Checking the Weighted Average Calculation¶
Let's calculate the weighted average value ourselves to see the result using a real example.
I will show the difference with an instance of 'Italy → Villasimius → E-MTB' sector. It is only two rides there, so it will be easy to count.
In my code 'color' argument is represented by df.seconds/60/60 – seconds converted into hours. In the dataset, these two rides correspond to values 6056 and 6605 seconds and can be calculated to hours as 6056/60/60 = 1.68222 and 6605/60/60 = 1.83472. Average of these two will be simple (1.68222+1.83472) / 2 = 1.75847 hours or 6330.5 seconds.
But sunburst shows 1.76949 hours or 6370.16 seconds and that is why: phrase "weighted by their values" means that weight of our values should also be taken into account.
According to my code, values come from the 'distance' column: 17.02 km corresponds to 6056 seconds and 22.77 to 6605.
So calculation of weighted average will be: (17.02×6056 + 22.77×6605) / (17.02+22.77) = (103073.12 + 150395.85) / 39.79 = 6370.16763 seconds or 1.76949 hours, which is exactly the value that the graph displays.
I found out empirically that in my case the weighted average value exceeds the arithmetic mean for all cycling variables.
Bike Comparison by Hours Spent and Distance Covered ¶
Comparing apples to apples :D I mean bike to bike by hours spent and distance covered.
px.pie(df,
names='bike',
values=df.seconds/60/60,
color_discrete_sequence=['blue', 'green', 'black', 'orange',
'red', 'purple', 'yellow'],
height=500,
hole=.7,
labels={'bike': 'Bike', 'values': 'Hours'},
title='Hours Spent on Bike')
Winston Churchill knew what he was talking about when he said: "No hour of life is wasted that is spent in the saddle."
He said this about a horse, but the quote applies just as much to a bicycle.
Especially if the bikefit procedure has been carried out and the saddle fits anatomically.
px.pie(df,
names='bike',
values=df.distance,
color_discrete_sequence=['blue', 'green', 'black', 'orange',
'red', 'yellow', 'purple'],
height=500,
hole=.7,
labels={'bike': 'Bike', 'distance': 'Distance km'},
title='Distance Covered on Bike')
If you compare both of these pie charts, it will suddenly become clear which bike is faster and which is slower.
If the percentage of distance exceeds the percentage of time, then the bike is fast and vice versa.
To make comparison easier, let's introduce an speed factor, where 1 would mean equal percentage, and value > 1 will mean that a larger percentage of kilometers is covered per unit of time (the bike is faster).
So for my bikes it will be:
data = {'bike': ['Hybrid Al', 'Road Al', 'Road Ti', 'MTB St', 'Rented',
'E-MTB', 'City'],
'hours_%': [35.4, 38.2, 18.7, 6.2, 0.966, 0.25, 0.237],
'distance_%': [31.8, 40.3, 22.4, 4.83, 0.455, 0.162, 0.141]}
eff = pd.DataFrame(data)
eff['speed_factor'] = (eff['distance_%'] / eff['hours_%'])
eff
| bike | hours_% | distance_% | speed_factor | |
|---|---|---|---|---|
| 0 | Hybrid Al | 35.400 | 31.800 | 0.898305 |
| 1 | Road Al | 38.200 | 40.300 | 1.054974 |
| 2 | Road Ti | 18.700 | 22.400 | 1.197861 |
| 3 | MTB St | 6.200 | 4.830 | 0.779032 |
| 4 | Rented | 0.966 | 0.455 | 0.471014 |
| 5 | E-MTB | 0.250 | 0.162 | 0.648000 |
| 6 | City | 0.237 | 0.141 | 0.594937 |
As expected, Road Ti turned out to be the fastest, and Rented bikes are the slowest.
Bike Rides by Time Categories ¶
Let's see how my rides are distributed by year, season, month, week of the year and day of the week:
fig = px.bar(df,
x='year',
y='distance',
color='bike',
height=500,
color_discrete_sequence=['green', 'red', 'blue', 'purple', 'yellow',
'orange', 'black'],
hover_data=['date', 'duration', 'location'],
labels={'distance': 'Distance km', 'year': 'Year', 'bike': 'Bike',
'date': 'Date', 'duration': 'Duration', 'location': 'Location'},
title='Bike Rides by Year')
fig.update_xaxes(type='category')
I started riding really passionately since 2019, when I exceeded the 3000 km mark in a year for the first time.
In 2022, thanks to wintering in Cyprus, I reached a record of 6034.51 km.
The year 2023 was notable for achieving a record of 5705.50 km in Estonia alone.
I take these exact numbers from my previously saved Excel file (one more reason to save data as xlsx and not csv), but you can see that the graph perfectly confirms them. Finally I can visualize it!
During all this time I have owned four bicycles (Hybrid Al, Road Al, MTB St and Road Ti) and the timeline of ownership can be clearly traced over the years.
Through all these years Hybrid Al is with me and in 2022 I owned four bicycles at the same time :D
That same 2022 year I changed my main road bike from aluminium to titanium and now I only have two bikes: Hybrid Al and Road Ti.
px.bar(df,
x='season',
y='distance',
color='bike',
height=500,
category_orders={'season': ['Winter', 'Spring', 'Summer', 'Autumn']},
color_discrete_sequence=['green', 'red', 'blue', 'purple', 'yellow',
'orange', 'black'],
hover_data=['date', 'duration', 'location'],
labels={'distance': 'Distance km', 'season': 'Season', 'bike': 'Bike',
'date': 'Date', 'duration': 'Duration', 'location': 'Location'},
title='Bike Rides by Season')
It is clear that winter is the least popular season (1033.79) and most of these kilometers come from Cyprus.
Summer is a great time for cycling, as evidenced by the 12743.06 kilometers covered.
In autumn, when I’m at the peak of my physical fitness after summer, I continue to ride almost until frost (8613.48 km).
In spring, weakened after winter, overcoming the cold and laziness, I am just beginning to roll out, so the result is rather weak (5854.99 km).
px.bar(df,
x='month',
y='distance',
color='bike',
height=500,
category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']},
color_discrete_sequence=['green', 'red', 'blue', 'purple', 'yellow',
'orange', 'black'],
hover_data=['date', 'duration', 'location'],
labels={'distance': 'Distance km', 'month': 'Month', 'bike': 'Bike',
'date': 'Date', 'duration': 'Duration', 'location': 'Location'},
title='Bike Rides by Month')
The most popular month for me is June with 4630.87 km, followed by almost equal August (4253.59 km) and September (4223.06 km).
The worst conditions for riding are in January (272.65 km) and December (355.49 km).
When road bikes (Road Al and Road Ti) have cycling season from May to early October, Hybrid Al used mainly in spring and autumn.
px.bar(df,
x='week',
y='distance',
color='bike',
height=500,
color_discrete_sequence=['green', 'red', 'blue', 'purple', 'yellow',
'orange', 'black'],
hover_data=['date', 'duration', 'location'],
labels={'distance': 'Distance km', 'week': 'Week Number', 'bike': 'Bike',
'date': 'Date', 'duration': 'Duration', 'location': 'Location'},
title='Bike Rides by Week of the Year')
I'm proud to say that there are only two weeks of the year with numbers 3 and 50 that I have never ridden a bike!
When Estonian cycling season maximum lasts from week 11 to 49 (Mar 19 till Dec 6), I was able to ride during the winter in Cyprus 2021/2022 and once on a trip in Abu Dhabi.
Most popular week for cycling is 39 (end of September) with total of 1497.31 km covered.
I know myself, and the graph shows that the most popular day of the year is August 25th with 681.51 km covered.
px.bar(df,
x='dayoftheweek',
y='distance',
color='bike',
height=500,
hover_data=['date', 'duration', 'location'],
color_discrete_sequence=['green', 'red', 'blue', 'purple', 'yellow',
'orange', 'black'],
category_orders ={'dayoftheweek':['MON', 'TUE', 'WED', 'THU', 'FRI',
'SAT', 'SUN']},
labels={'distance': 'Distance km', 'dayoftheweek': 'Day of the Week',
'bike': 'Bike', 'date': 'Date', 'duration': 'Duration',
'location': 'Location'},
title='Bike Rides by Day of the Week')
This graph clearly shows that the end of the week from Friday to Sunday is the busiest.
Peak of my activity is on Sunday with 5569.01 km and worst result on Thursday – 2620.86 km.
This is of course due to the work week. It’s interesting that the tendency is equally repeated for most of the bicycles, which means the trend is very strong.
Temperature ¶
With the purely cycling part analyzed, let's move on to the weather data!
I have always liked the OHLC candlestick chart in Plotly. It is usually used for stock visualizations, but it can also be used to visualize air temperature, because we have tempmax and tempmin.
Since I don't have hourly data, I will assume that value in 'temp' column is open and close:
fig = go.Figure(data=[go.Candlestick(x=df['date'],
open=df['temp'],
high=df['tempmax'],
low=df['tempmin'],
close=df['temp'],
increasing_line_color= 'red',
decreasing_line_color= 'blue')])
fig.update_layout(
title='Temperature Range on Cycling Days',
yaxis_title='Degrees °C')
fig.add_hrect(
y0="7", y1="27",
fillcolor="green", opacity=0.4,
layer="below", line_width=0)
fig.show()
This candlestick chart shows the temperature range on any given day of my cycling.
The green zone is the range that is normal for me between +7°С and +27°С degrees Celsius.
If the temperature rises relative to the previous ride, then the color of the line is red, if it falls, it is blue.
As you can see, the number of days that are colder than my normal range is much greater than the number of hotter ones. Global warming has not yet reached Estonia :D
Feel free to use the range slider and observe any time period from 2015 to 2023.
Temperature Perception¶
But let's be honest, I prefer to ride during those hours of the day when the temperature is above average and closer to maximum.
Let's call this variable 'ridetemp' and find out at what temperature levels I rode.
At first I make subset with columns I need and calculate ride temperature as value between 'temp' and 'tempmax':
td = df[['date', 'tempmax', 'temp']]
td['ridetemp'] = (td.tempmax - td.temp) / 2 + td.temp
td.head()
| date | tempmax | temp | ridetemp | |
|---|---|---|---|---|
| 0 | 2015-05-01 | 16.0 | 10.6 | 13.30 |
| 1 | 2015-05-20 | 16.0 | 12.3 | 14.15 |
| 2 | 2015-05-30 | 19.0 | 11.8 | 15.40 |
| 3 | 2015-06-05 | 17.0 | 12.9 | 14.95 |
| 4 | 2015-06-22 | 16.0 | 14.1 | 15.05 |
Then I assign temperature levels depending on the level of perception and add them as new column:
def assign_templevel(row):
if row['ridetemp'] <= 7.0:
return 'Cold Weather: below +7°C'
elif row['ridetemp'] > 7.0 and row['ridetemp'] <= 17.0:
return 'Moderate Weather: +7..+17°C'
elif row['ridetemp'] > 17.0 and row['ridetemp'] <= 22.0:
return 'Comfort Weather: +17..+22°C'
elif row['ridetemp'] > 22.0 and row['ridetemp'] <= 27.0:
return 'Warm Weather: +22..+27°C'
else:
return 'Hot Weather: above +27°C'
td.loc[:, 'temppercep'] = td.apply(assign_templevel, axis=1)
td.head()
| date | tempmax | temp | ridetemp | temppercep | |
|---|---|---|---|---|---|
| 0 | 2015-05-01 | 16.0 | 10.6 | 13.30 | Moderate Weather: +7..+17°C |
| 1 | 2015-05-20 | 16.0 | 12.3 | 14.15 | Moderate Weather: +7..+17°C |
| 2 | 2015-05-30 | 19.0 | 11.8 | 15.40 | Moderate Weather: +7..+17°C |
| 3 | 2015-06-05 | 17.0 | 12.9 | 14.95 | Moderate Weather: +7..+17°C |
| 4 | 2015-06-22 | 16.0 | 14.1 | 15.05 | Moderate Weather: +7..+17°C |
px.pie(td,
names='temppercep',
values=td.value_counts().values,
color_discrete_sequence=['cyan', 'green', 'blue', 'orange', 'red'],
height=500,
hole=0.7,
labels={'temppercep': 'Perception', 'values': 'Rides'},
title='Temperature Perception on Cycling Days')
Nearly half of the rides take place in moderate temperatures, such is the weather in our latitudes.
Сomfort weather is 34.5%, which is also quite a lot.
Сold and hot weather together make just under 10%, of course with a great advantage of the cold.
The main problem with this is that I don't have the right clothes and generally the ability to withstand such temperatures on a bike. When it's cold, my toes freeze the most, and when it's hot, I can even get heatstroke.
UV Index and Daylight Hours ¶
The UV Index as a scale of risk of harm was developed by the World Health Organization, the United Nations Environment Program and the World Meteorological Organization. The UV index allows you to assess the danger of ultraviolet radiation in the sunlight spectrum for human skin.
Since ideal cycling weather is dry and sunny, I expect high UV index scores to be prevalent in my data.
Further I will continue to use the "subset → function → graph" scheme to display weather variables.
I won't describe this in detail anymore, let my code speak for itself:
uv = df[['date', 'location', 'uvindex', 'month', 'daylength']]
def assign_ulv(row):
if row['uvindex'] >= 8:
return 'Very High'
elif row['uvindex'] >= 6 and row['uvindex'] <= 7:
return 'High'
elif row['uvindex'] >= 3 and row['uvindex'] <= 5:
return 'Moderate'
else:
return 'Low'
uv.loc[:, 'uvexposure'] = uv.apply(assign_ulv, axis=1)
uv.head()
| date | location | uvindex | month | daylength | uvexposure | |
|---|---|---|---|---|---|---|
| 0 | 2015-05-01 | Tallinn | 3 | May | 15:51:57 | Moderate |
| 1 | 2015-05-20 | Tallinn | 6 | May | 17:21:27 | High |
| 2 | 2015-05-30 | Tallinn | 7 | May | 17:59:32 | High |
| 3 | 2015-06-05 | Tallinn | 8 | Jun | 18:17:23 | Very High |
| 4 | 2015-06-22 | Tallinn | 7 | Jun | 18:39:19 | High |
px.pie(uv,
names='uvexposure',
values=uv.value_counts().values,
color_discrete_sequence=['orange', 'red', 'yellow', 'green'],
height=500,
hole=.7,
labels={'uvexposure': 'UV Exposure', 'values': 'Rides'},
title='Ultraviolet Exposure on Cycling Days')
I took the scale gradations from the Wikipedia article and in the graphs I will also adhere to the color designations from there.
About 2/3 of the rides took place under high or very high UV exposure.
To my shame, I almost never use sunscreen and there were days when I got sunburned. The worst sun I encountered was in Japan, on the island of Kumejima, I literally got a sunburn in 15 minutes. Even though I had sunscreen on :D
Average UV Indices by Month¶
It would be interesting to know the average UV indices by month and how they depend on daylight hours and whether this affects the average duration of rides.
I calculated the length of daylight in seconds earlier by subtracting 'sunrise' from 'sunset' in dataframe.
First I group data by month to get averages:
dl = df.groupby(df['month'])[['seconds', 'daysec', 'tempmax', 'uvindex']].mean().reset_index()
dl
| month | seconds | daysec | tempmax | uvindex | |
|---|---|---|---|---|---|
| 0 | Apr | 10320.225000 | 52185.475000 | 12.990000 | 6.300000 |
| 1 | Aug | 11277.344262 | 55588.803279 | 21.196721 | 6.377049 |
| 2 | Dec | 8801.444444 | 31284.222222 | 11.822222 | 2.777778 |
| 3 | Feb | 11986.125000 | 39493.875000 | 17.900000 | 6.000000 |
| 4 | Jan | 9300.000000 | 36244.285714 | 15.871429 | 4.714286 |
| 5 | Jul | 11387.122807 | 63633.543860 | 22.612281 | 7.140351 |
| 6 | Jun | 11893.333333 | 66609.363636 | 21.930303 | 7.575758 |
| 7 | Mar | 11141.571429 | 43925.785714 | 13.150000 | 5.714286 |
| 8 | May | 9957.016949 | 61103.220339 | 17.796610 | 7.203390 |
| 9 | Nov | 13834.434783 | 28948.086957 | 8.304348 | 1.043478 |
| 10 | Oct | 12115.142857 | 37447.142857 | 11.447619 | 2.642857 |
| 11 | Sep | 11376.222222 | 45860.523810 | 16.980952 | 4.809524 |
Then I convert the received seconds into hh:mm:ss format one by one to get user-friendly values for hover_data and add them to dl dataframe:
sec = 45861
convert = str(datetime.timedelta(seconds = sec))
print(convert)
12:44:21
dl.insert(1, 'duration', ['02:52:00', '03:07:57', '02:26:41', '03:19:46', '02:35:00',
'03:09:47', '03:18:13', '03:05:42', '02:45:57', '03:50:34',
'03:21:55', '03:09:36'], True)
dl.insert(2, 'daylength', ['14:29:45', '15:26:29', '08:41:24', '10:58:14', '10:04:04',
'17:40:34', '18:30:09', '12:12:06', '16:58:23', '08:02:28',
'10:24:07', '12:44:21'], True)
dl
| month | duration | daylength | seconds | daysec | tempmax | uvindex | |
|---|---|---|---|---|---|---|---|
| 0 | Apr | 02:52:00 | 14:29:45 | 10320.225000 | 52185.475000 | 12.990000 | 6.300000 |
| 1 | Aug | 03:07:57 | 15:26:29 | 11277.344262 | 55588.803279 | 21.196721 | 6.377049 |
| 2 | Dec | 02:26:41 | 08:41:24 | 8801.444444 | 31284.222222 | 11.822222 | 2.777778 |
| 3 | Feb | 03:19:46 | 10:58:14 | 11986.125000 | 39493.875000 | 17.900000 | 6.000000 |
| 4 | Jan | 02:35:00 | 10:04:04 | 9300.000000 | 36244.285714 | 15.871429 | 4.714286 |
| 5 | Jul | 03:09:47 | 17:40:34 | 11387.122807 | 63633.543860 | 22.612281 | 7.140351 |
| 6 | Jun | 03:18:13 | 18:30:09 | 11893.333333 | 66609.363636 | 21.930303 | 7.575758 |
| 7 | Mar | 03:05:42 | 12:12:06 | 11141.571429 | 43925.785714 | 13.150000 | 5.714286 |
| 8 | May | 02:45:57 | 16:58:23 | 9957.016949 | 61103.220339 | 17.796610 | 7.203390 |
| 9 | Nov | 03:50:34 | 08:02:28 | 13834.434783 | 28948.086957 | 8.304348 | 1.043478 |
| 10 | Oct | 03:21:55 | 10:24:07 | 12115.142857 | 37447.142857 | 11.447619 | 2.642857 |
| 11 | Sep | 03:09:36 | 12:44:21 | 11376.222222 | 45860.523810 | 16.980952 | 4.809524 |
fig = px.bar(dl,
x='month',
y=dl.daysec/60/60,
color='uvindex',
height=500,
category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']},
hover_name='daylength',
hover_data=[dl.duration, dl.tempmax.round(2), dl.uvindex.round(2)],
color_continuous_scale='RdYlGn_r',
labels={'y': 'Avg Day Length in Hours', 'month': 'Month',
'duration': 'Avg Duration', 'hover_data_1': 'Avg Max Temperature °C',
'hover_data_2': 'Avg UV Index', 'uvindex': 'UV Index'},
title='Monthly Averages of Day Length, Ride Duration, Maximum Temperature and UV Index')
fig.add_trace(go.Scatter(
x=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
y=[9300/60/60, 11986/60/60, 11142/60/60, 10320/60/60, 10017/60/60, 11893/60/60,
11439/60/60, 11277/60/60, 11376/60/60, 12115/60/60, 13834/60/60, 8801/60/60],
name='Avg Duration', line=dict(color='royalblue', width=6.5),
showlegend=False))
I converted the seconds on the 'y' scale to hours to make it easier to understand the values. Blue line represents average ride duration.
Note how cleverly I rounded 'tempmax' and 'uvindex' in hover data and then assigned labels to them.
The usual rounding through a dictionary is not applicable here, since 'duration' is object and should remain unchanged. I found this elegant solution through trial and error.
The UV index, as expected, directly correlates with day length.
But daylight hours depend on the geographical location, and since all my locations are represented on this graph, predominantly Cypriot December have a higher rate than pure Estonian November.
This may also explain why the average UV index is already quite high starting in January.
It is interesting to note that the average maximum temperature on cycling days does not vary much between months.
Reaching a average maximum of +22.6°C in July, it fluctuates in an acceptable range above +11°C degrees, and only in November drops to +8.3°C.
The average duration of the ride is from 2.5 to 4 hours, with the average maximum falling just in the cold and dark November, when I usually finish cycling in Estonia.
This can be explained by the desire to make the most of the end of the cycling season, where I used every ride to increase my maximum mileage for the year.
Relationship Between Ride Duration and Daylight Hours¶
g = sns.jointplot(data=df, x=df.daysec/60/60, y=df.seconds/60/60, color='yellow')
g.plot_joint(sns.kdeplot, cmap='hot_r', n_levels=60, fill=True)
g.fig.suptitle('Frequency and Duration of Rides Depending on Daylight Hours')
g.ax_joint.set_xlabel('Daylight Hours')
g.ax_joint.set_ylabel('Cycling Hours')
g.figure.tight_layout()
plt.show()
My hometown Tallinn is located at the 59th parallel north latitude. This means long days in summer and short ones in winter (extreme values in my dataframe are 18:39:27 (Jun 21) and 06:21:48 (Dec 6). Helsinki is even further north and the longest daylight of my cycling days happened there 18:44:21 (Jul 2).
Ride duration variable have some outliers (shortest is 00:03:40 and longest 11:34:15), but the main part is inside the highlighted bivariate kernel density estimation (KDE) area.
This jointplot visualizes how ride duration depend on daylight hours.
It is clearly seen that most often the duration of my rides is from 2 to 4 hours, and I especially like to cycle on those days when there is over 15 hours of daylight.
This corresponds to the months from May to August.
From other graphs it is clear that these months + September are the most popular by the number of rides (total 306 or 68% of entries).
UV Indices of Bike Rides by Month and Location¶
Now let's see how the UV indices of bike rides are distributed by month and by location:
px.bar(uv,
x='month',
y=uv.value_counts().values,
color='uvexposure',
height=500,
category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']},
color_discrete_sequence=['yellow', 'orange', 'red', 'green'],
hover_data=['date', 'daylength', 'location', 'uvindex'],
labels={'y': 'Count', 'month': 'Month', 'uvexposure': 'UV Exposure', 'date': 'Date',
'daylength': 'Day Length', 'location': 'Location', 'uvindex': 'UV Index'},
title='Bike Rides by Month and UV Index')
The overall graph shows how ultraviolet radiation remains consistently high from February to August, and becomes very high in May–July.
September and October are transition months. The first is from high to moderate UV exposure, the second is from moderate to low.
All November rides took place at a low UV index.
Tln = ['Tallinn']
uvt = uv[uv['location'].isin(Tln)]
px.bar(uvt,
x='month',
y=uvt.value_counts().values,
color='uvexposure',
height=500,
category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']},
color_discrete_sequence=['yellow', 'orange', 'red', 'green'],
hover_data=['date', 'daylength', 'location', 'uvindex'],
labels={'y': 'Count', 'month': 'Month', 'uvexposure': 'UV Exposure', 'date': 'Date',
'daylength': 'Day Length', 'location': 'Location', 'uvindex': 'UV Index'},
title='Tallinn Bike Rides by Month and UV Index')
The UV index varies greatly depending on geographic location.
In my hometown Tallinn, the UV index is obviously affected by daylight hours.
The graph consistently shows highs around the June peak, when day length is at its longest, and lows at the end of the year.
Also in Tallinn I have never ridden a bike in January or February.
And for contrast, here is what the UV index looks like in all other locations outside of Tallinn:
uvnt = uv[~uv['location'].isin(Tln)]
px.bar(uvnt,
x='month',
y=uvnt.value_counts().values,
color='uvexposure',
height=500,
category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']},
color_discrete_sequence=['orange', 'red', 'green', 'yellow'],
hover_data=['date', 'daylength', 'location', 'uvindex'],
labels={'y': 'Count', 'month': 'Month', 'uvexposure': 'UV Exposure', 'date': 'Date',
'daylength': 'Day Length', 'location': 'Location', 'uvindex': 'UV Index'},
title='Outside Tallinn Bike Rides by Month and UV Index')
Please note that both places with a low UV index, along with Tallinn, are in Estonia.
Humidity, Cloudiness and Precipitation ¶
h = sns.jointplot(data=df, x='humidity', y='cloudcover', kind='hist', fill=True)
h.fig.suptitle('Relationship Between Humidity and Cloudiness')
h.ax_joint.set_xlabel('Humidity %')
h.ax_joint.set_ylabel('Cloudiness %')
h.figure.tight_layout()
plt.show()
It was seen from the correlation matrix that there is a strong positive covariance between 'humidity' and 'cloudcover' (0.64).
Joint plot reveals this connection in more detail.
Note that cloudiness varies easily from 0% to 100%, while humidity ranges from 37.1% to 98.5%.
Precipitation¶
It is clear that I try to avoid the rain, but how successful am I?
Let's find out by using a subset and defining a function:
precip = df[['date', 'precip']]
def assign_intensity(row):
if row['precip'] == 0.000:
return 'No Rain'
elif row['precip'] > 0.000 and row['precip'] < 1.000:
return 'Very Light Rain'
elif row['precip'] >= 1.000 and row['precip'] <= 10.000:
return 'Light Rain'
else:
return 'Moderate Rain'
precip.loc[:, "intensity"] = precip.apply(assign_intensity, axis=1)
precip.head()
| date | precip | intensity | |
|---|---|---|---|
| 0 | 2015-05-01 | 0.000 | No Rain |
| 1 | 2015-05-20 | 0.201 | Very Light Rain |
| 2 | 2015-05-30 | 0.399 | Very Light Rain |
| 3 | 2015-06-05 | 0.000 | No Rain |
| 4 | 2015-06-22 | 4.171 | Light Rain |
px.pie(precip,
names='intensity',
values=precip.value_counts().values,
color_discrete_sequence=['green', 'lightblue', 'blue', 'navy'],
height=500,
hole=.7,
labels={'intensity': 'Intensity', 'values': 'Rides'},
title='Precipitation on Cycling Days')
When determining precipitation intervals, I was guided by Rainfall Classification: Intensity of Rainfall in 24 Hours.
As you can see, I quite successfully managed to avoid heavy rain and only occasionally fell under moderate rain.
This graph does not mean that I rode in the rain on all the days when it rained. Most often, on the day when I was riding and it was raining, it rained at a different time of day. Until now I remember only about 5 times when I deliberately left the house in the rain. More often than not, rain caught me unexpectedly on my way far from home.
Monthly Averages¶
It’s also interesting to see how humidity, cloudiness and precipitation vary by month. As I remember from the correlation matrix, there is a negative covariance between day length and humidity with cloudiness.
Let's visualize the monthly average values of these parameters.
As before, I group data by month to get average values:
mhc = df.groupby(df['month'])[['daysec', 'precip', 'cloudcover', 'humidity']].mean().reset_index()
mhc
| month | daysec | precip | cloudcover | humidity | |
|---|---|---|---|---|---|
| 0 | Apr | 52185.475000 | 0.304675 | 25.837500 | 61.122500 |
| 1 | Aug | 55588.803279 | 1.363131 | 37.683607 | 75.916393 |
| 2 | Dec | 31284.222222 | 0.539111 | 57.522222 | 74.788889 |
| 3 | Feb | 39493.875000 | 0.018625 | 41.025000 | 67.800000 |
| 4 | Jan | 36244.285714 | 0.408429 | 42.385714 | 69.828571 |
| 5 | Jul | 63633.543860 | 0.186053 | 35.501754 | 70.029825 |
| 6 | Jun | 66609.363636 | 0.583091 | 26.340909 | 64.227273 |
| 7 | Mar | 43925.785714 | 0.154071 | 48.128571 | 64.021429 |
| 8 | May | 61103.220339 | 0.264390 | 27.638983 | 58.872881 |
| 9 | Nov | 28948.086957 | 1.166348 | 77.504348 | 87.778261 |
| 10 | Oct | 37447.142857 | 0.483452 | 63.904762 | 84.021429 |
| 11 | Sep | 45860.523810 | 0.260587 | 45.430159 | 80.534921 |
And convert the seconds into a readable format one by one:
sec = 45861
convert = str(datetime.timedelta(seconds = sec))
print(convert)
12:44:21
mhc.insert(2, 'avgdaylength', ['14:29:45', '15:26:29', '08:41:24',
'10:58:14', '10:04:04', '17:40:34',
'18:30:09', '12:12:06', '16:58:23',
'08:02:28', '10:24:07', '12:44:21'], True)
px.bar(mhc,
x='month',
y=mhc.daysec/60/60,
color='humidity',
height=500,
category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']},
hover_name='avgdaylength',
hover_data={'humidity':':.2f', 'cloudcover':':.2f', 'precip':':.2f'},
color_continuous_scale='deep',
labels={'y': 'Avg Day Length in Hours', 'month': 'Month',
'humidity': 'Avg Humidity %', 'cloudcover': 'Avg Cloudiness %',
'precip': 'Avg Precipitation mm', 'avgdaylength': 'Avg Day Length'},
title='Monthly Averages of Day Length, Humidity, Cloudiness and Precipitation')
Seconds on the 'y' scale are converted to hours for better understanding.
The most humid month is November and the driest month is May.
The connection with daylight hours is also quite obvious, except that the peak of the dry season is not in June, but in May.
Cloudiness indicators most often follow humidity, there is a positive correlation.
Precipitation figures fluctuate between months, but this can easily be explained by the fact that I deliberately avoided rainy days.
Wind ¶
And here is my favorite part – wind diagrams!
First of all, it's beautiful.
Secondly, it is quite rare for an analyst to deal with such a variable as wind.
Thirdly, the graphs that can be constructed for the wind variable are unique, but at the same time very diverse.
So forgive me, but I won't rest until I build them all!
Polar Plot¶
To visualize wind related data, I create a data subset with columns 'location', 'winddir' and 'windspeed':
wd = df[['location', 'winddir', 'windspeed']]
wd.head()
| location | winddir | windspeed | |
|---|---|---|---|
| 0 | Tallinn | 124.7 | 23.9 |
| 1 | Tallinn | 255.3 | 22.0 |
| 2 | Tallinn | 125.2 | 23.9 |
| 3 | Tallinn | 261.2 | 34.8 |
| 4 | Tallinn | 294.5 | 14.5 |
ax = plt.subplot(111, polar=True)
ax.scatter(x=[radians(x) for x in wd['winddir'].values],
y=wd['windspeed'].values,
color = 'blue', edgecolor = 'navy', alpha = 0.5)
plt.title('Polar Plot of Rides by Wind Direction and Speed')
ax.set_theta_zero_location('N')
ax.set_theta_direction(-1)
An excellent graph showing the direction and speed of the wind for each ride. Due to the transparency argument alpha = 0.5, areas of increased ride concentration appear darker.
Please note that the wind direction is indicated in degrees: 0° is north and 225°, for example, is southwest. The scale on the polar plot measures the wind speed.
Most often I rode on days when the wind speed was between 10 and 20 km/h.
Obviously, riding on days when the wind speed exceeds my own average speed of 20.09 km/h is already more difficult.
It’s clear that I don’t ride against the wind all the time, but still.
Wind Roses¶
To work with wind rose graphs, I need to install Windrose library:
pip install windrose
Requirement already satisfied: windrose in c:\python\lib\site-packages (1.9.0)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: matplotlib>=3 in c:\python\lib\site-packages (from windrose) (3.8.1) Requirement already satisfied: numpy>=1.21 in c:\python\lib\site-packages (from windrose) (1.26.1) Requirement already satisfied: contourpy>=1.0.1 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (1.1.1) Requirement already satisfied: cycler>=0.10 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (4.44.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (1.4.5) Requirement already satisfied: packaging>=20.0 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (23.2) Requirement already satisfied: pillow>=8 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (10.1.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in c:\python\lib\site-packages (from matplotlib>=3->windrose) (2.8.2) Requirement already satisfied: six>=1.5 in c:\python\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3->windrose) (1.16.0)
[notice] A new release of pip is available: 23.2.1 -> 24.0 [notice] To update, run: python.exe -m pip install --upgrade pip
Now I assign wind directions according to 16-point compass rose to create different types of wind rose charts:
from windrose import WindroseAxes
def assign_direction(row):
if row['winddir'] <= 11.25 or row['winddir'] >= 348.75:
return 'N'
elif row['winddir'] > 11.25 and row['winddir'] < 33.75:
return 'NNE'
elif row['winddir'] >= 33.75 and row['winddir'] <= 56.25:
return 'NE'
elif row['winddir'] > 56.25 and row['winddir'] < 78.75:
return 'ENE'
elif row['winddir'] >= 78.75 and row['winddir'] <= 101.25:
return 'E'
elif row['winddir'] > 101.25 and row['winddir'] < 123.75:
return 'ESE'
elif row['winddir'] >= 123.75 and row['winddir'] <= 146.25:
return 'SE'
elif row['winddir'] > 146.25 and row['winddir'] < 168.75:
return 'SSE'
elif row['winddir'] >= 168.75 and row['winddir'] <= 191.25:
return 'S'
elif row['winddir'] > 191.25 and row['winddir'] < 213.75:
return 'SSW'
elif row['winddir'] >= 213.75 and row['winddir'] <= 236.25:
return 'SW'
elif row['winddir'] > 236.25 and row['winddir'] < 258.75:
return 'WSW'
elif row['winddir'] >= 258.75 and row['winddir'] <= 281.25:
return 'W'
elif row['winddir'] > 281.25 and row['winddir'] < 303.75:
return 'WNW'
elif row['winddir'] >= 303.75 and row['winddir'] <= 326.25:
return 'NW'
else:
return 'NNW'
wd['direction'] = wd.apply(assign_direction, axis=1)
ax = WindroseAxes.from_ax()
ax.bar(wd.winddir, wd.windspeed, normed=True, opening=0.8,
edgecolor='white', bins=np.arange(0, 50, 10))
plt.title('General Wind Rose on Cycling Days', fontsize = 20)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d3494ee5d0>
This is a wind rose chart showing the 16 wind directions.
I use the argument 'normed=True' here, this means that the scale shown on the graph represents percentages.
That is, for example, the western wind direction is the most popular and it blew in 11.1% or during 50 rides.
I also set 'bins=np.arange(0, 50, 10)' to make it easier to compare wind speeds across all graphs. By default there are 6 such bins, they are divided equally between extreme values.
ax = WindroseAxes.from_ax()
ax.contourf(wd.winddir, wd.windspeed, cmap=cm.hsv, bins=np.arange(0, 50, 10))
plt.title('Filled Contour Wind Rose on Cycling Days', fontsize = 20)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d34970e5d0>
The contourf() function in Pyplot module of Matplotlib library is used to plot filled contours. The scale on this graph represents count.
Colormap 'hsv' allows us to see two tiny rides with wind speeds of less than 10 km/h – look at the directions S-W and S-E.
ax = WindroseAxes.from_ax()
ax.contour(wd.winddir, wd.windspeed, cmap=cm.plasma, lw=3, bins=np.arange(0, 50, 10))
plt.title('Contour Wind Rose on Cycling Days', fontsize = 20)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d3499fa410>
The contour wind rose looks good too!
And rides from the smallest bin (0.0 : 10.0) are even better visible.
ax = WindroseAxes.from_ax()
ax.contourf(wd.winddir, wd.windspeed, cmap=cm.hot, normed=True,
bins=np.arange(0, 50, 10))
ax.contour(wd.winddir, wd.windspeed, colors='black', normed=True,
bins=np.arange(0, 50, 10))
plt.title('Combined Contour Wind Rose on Cycling Days', fontsize = 20)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d349ac0ed0>
These two types of wind rose graphs can be combined with each other.
Please note that the scale is the same on all graphs, only the signed values change, because for example 40 is 8.9% of the 449 entries in dataframe and so on.
These were all kinds of wind roses that took into account all the rides.
But I can also make plots depending on the specific location!
Wind Roses by Location¶
There are only two such locations in my dataset, since in order to build a meaningful wind rose graph, it is desirable to have data on at least 10 bike rides in one place.
There are 402 entries for Tallinn and 32 for Ayia Napa.
ltl = ['Tallinn']
tln = wd[wd['location'].isin(ltl)]
ax = WindroseAxes.from_ax()
ax.contourf(tln.winddir, tln.windspeed, cmap=cm.winter, bins=np.arange(0, 50, 10))
plt.title('Tallinn Wind Rose on Cycling Days', fontsize = 20)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d349d23590>
The shape of the wind rose itself has changed just slightly (which is understandable since Tallinn’s share is 89.5% of rides).
The only noticeable difference is a decrease in the maximum number of rides in the western direction – 43 versus 50 on the overall graph.
lan = ['Ayia Napa']
anp = wd[wd['location'].isin(lan)]
ax = WindroseAxes.from_ax()
ax.contourf(anp.winddir, anp.windspeed, cmap=cm.autumn, bins=np.arange(0, 50, 10))
plt.title('Ayia Napa Wind Rose on Cycling Days', fontsize = 20)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d349d77b10>
In Ayia Napa the wind rose is significantly different.
The only thing that remains is the dominance of the western direction.
The northwest and north are also popular, while the east to south are almost unrepresented.
Comparison of Wind Roses¶
Now let’s return to the overall wind rose and superimpose the wind roses of Tallinn and Ayia Napa on its graph.
This overlay gives a good idea of how many rides are involved in each case.
ax = WindroseAxes.from_ax()
ax.contour(wd.winddir, wd.windspeed, cmap=cm.copper, bins=np.arange(0, 50, 10))
ax.contourf(tln.winddir, tln.windspeed, cmap=cm.winter, bins=np.arange(0, 50, 10))
ax.contourf(anp.winddir, anp.windspeed, cmap=cm.autumn, bins=np.arange(0, 50, 10))
plt.suptitle('Comparison of Three Wind Roses', fontsize = 20, y=1)
plt.title('Overall, Tallinn and Ayia Napa on Cycling Days', fontsize = 16)
ax.set_legend()
<matplotlib.legend.Legend at 0x1d349d9eb90>
The colormaps of Tallinn and Ayia Napa remain unchanged and you can now clearly see the true subset size of each location.
This may not be so obvious at first glance, but the overall wind rose consists of Tallinn, Ayia Napa and 15 other rides combined.
For example, the western direction is 50 rides – 43 from Tallinn and 7 from Ayia Napa. That is, most of the empty contour is filled with Ayia Napa rides and only 3.34% remains for other locations.
MTB St – Muddyfox Energy 26, steel, owned during wintering in Cyprus 2021/2022
Sea Level Air Pressure ¶
Hypothesis Refutation¶
Ayia Napa is a good self-sufficient data subset and therefore I will use it again to analyze the relationship of the following variables: atmospheric pressure and average speed.
The hypothesis is that the average speed of a bicycle is higher when the air pressure is lower (i.e. there is a negative correlation).
This is an old debate in the cycling community. The statement itself has a scientific basis, but in real life it is difficult to prove.
The correlation matrix for the entire dataset showed almost no relationship (-0.07) between sea level pressure and the average speed of the bike.
Let's see if this is true for Ayia Napa alone:
an = df[df['location'] == 'Ayia Napa']
fig,ax = plt.subplots()
fig.set_size_inches(10, 6)
col1 = 'red'
col2 = 'blue'
ax.plot(an.date, an.avgspeed, color=col1)
ax.set_title('Average Bike Speed and Sea Level Air Pressure in Ayia Napa',
fontweight ='bold', fontsize = 16)
ax.set_xlabel('Cycling Days', fontsize = 14)
ax.set_ylabel('Average Bike Speed km/h', color = col1, fontsize = 14)
ax2 = ax.twinx()
ax2.plot(an.date, an.sealevelpressure, color = col2)
ax2.set_ylabel('Sea Level Air Pressure hPa', color = col2, fontsize = 14)
Text(0, 0.5, 'Sea Level Air Pressure hPa')
This graph have two 'y' scales and two different variables displayed as lines. I am interested in dynamics.
The graph shows that the lines are practically independent of each other. So, no correlation could be identified and no dependence was found.
A quick correlation calculation shows an extremely low covariance of 0.000808 between these variables in the Ayia Napa subset:
an['avgspeed'].corr(an['sealevelpressure'])
0.0008080254100268551
Sea Level Pressure Treemap¶
Now I will plot a treemap to show sea level pressure for each ride:
px.treemap(df,
path=[px.Constant('Rides'), 'location', 'bike', 'date'],
height=700,
color='sealevelpressure',
labels={'sealevelpressure': 'hPa', 'duration': 'Duration',
'distance': 'Distance km', 'avgspeed': 'Avg Speed km/h'},
color_continuous_scale='rainbow',
hover_name='sealevelpressure',
hover_data=['duration','distance','avgspeed'],
title='Sea Level Pressure Treemap of Rides Categorized by Location and Bike')
Let me remind you that the standard 1 atm pressure is 1013.25 hPa. What is higher is considered increased, and what is lower is considered decreased.
Air Pressure Levels¶
Let's find out what pressure prevails in my dataset:
ap = df[['date', 'sealevelpressure']]
def assign_pressurelevel(row):
if row['sealevelpressure'] < 998.0:
return 'Lowest'
elif row['sealevelpressure'] >= 998.0 and row['sealevelpressure'] < 1008.0:
return 'Low'
elif row['sealevelpressure'] >= 1008.0 and row['sealevelpressure'] <= 1018.0:
return 'Normal'
elif row['sealevelpressure'] > 1018.0 and row['sealevelpressure'] <= 1028.0:
return 'High'
else:
return 'Highest'
ap.loc[:, 'pressurelevel'] = ap.apply(assign_pressurelevel, axis=1)
px.pie(ap,
names='pressurelevel',
values=ap.value_counts().values,
color_discrete_sequence=['green', 'orange', 'deepskyblue', 'red', 'blue'],
height=500,
hole=.9,
labels={'pressurelevel': 'Sea Level Pressure', 'values': 'Rides'},
title='Sea Level Pressure on Cycling Days')
Well, more than half of the days atmospheric pressure was normal.
In the remaining part, increased sea level pressure prevails.
Decreased air pressure combined makes 11.14% or 50 days.
Something came to mind: "11:14" is a cool film in the strange genre of neo-noir black comedy thriller. This movie is worth watching :)
Lunar Phases ¶
The last variable that can be analyzed is the phases of the moon.
There are 4 principal (new moon, first quarter, full moon, last quarter) and 4 intermediate lunar phases (waxing crescent, waxing gibbous, waning gibbous, waning crescent).
Let's find out their distribution:
moon = df[['date', 'moonphase']]
def assign_lunarphase(row):
if row['moonphase'] == 0:
return 'New Moon'
elif row['moonphase'] >= 0.01 and row['moonphase'] <= 0.24:
return 'Waxing Crescent'
elif row['moonphase'] == 0.25:
return 'First Quarter'
elif row['moonphase'] >= 0.26 and row['moonphase'] <= 0.49:
return 'Waxing Gibbous'
elif row['moonphase'] == 0.5:
return 'Full Moon'
elif row['moonphase'] >= 0.51 and row['moonphase'] <= 0.74:
return 'Waning Gibbous'
elif row['moonphase'] == 0.75:
return 'Third Quarter'
else:
return 'Waning Crescent'
moon.loc[:, 'lunarphase'] = moon.apply(assign_lunarphase, axis=1)
px.pie(moon,
names='lunarphase',
values=moon.value_counts().values,
color_discrete_sequence=['maroon', 'cyan', 'darkblue', 'darksalmon',
'whitesmoke', 'red', 'black', 'blue'],
hole=0.7,
height=500,
labels={'lunarphase': 'Lunar Phase', 'values': 'Rides'},
title='Lunar Phases on Cycling Days')
The phases of the moon do not depend in any way on geographical location or time of year.
The correlation matrix also showed the absence of any significant relationship between this variable and anything.
The approximately equal distribution of rides across lunar phases is indicative.
If we compare two large groups of phases like the waxing and waning moon, we will see that they are almost equal.
From waxing crescent to full moon and from waning gibbous to new moon: waxing moon makes up 222 rides vs waning moon with 227 rides.
Summary of the Analysis ¶
Let's summarize what I found interesting.
Some of what was discovered was obvious from the start, but some were surprising:
- The greater the distance covered, the longer it will take.
- Longer distances tend to have higher average speeds.
- The maximum distance covered in one ride has grown almost linearly since 2020 in increments of 50 km.
- The maximum speed in one ride does not grow steadily from year to year, but has two peaks in 2021 (28.15 km/h) and 2022 (28.42 km/h).
- All rides with an average speed of less than 10 km/h take place abroad.
- The Road Titanium bike is the busiest and fastest.
- This bike also holds the record for my longest ride of all time: 311.3 km and a time of 11:34:15.
- I found out empirically that in my case the weighted average value exceeds the arithmetic mean for all cycling variables.
- In year 2022, thanks to wintering in Cyprus, I reached the annual record of 6034.51 km.
- The year 2023 was notable for achieving a record of 5705.50 km in Estonia alone.
- All these years Hybrid Al is with me and in 2022 I owned four bicycles at the same time :D
- Winter is the least popular season of the year (1033.79 km) and most of these kilometers come from Cyprus.
- Summer is a great time for cycling, as evidenced by the 12743.06 kilometers covered. However, this is only 45% of the total.
- The most popular month for me is June with 4630.87 km, followed by almost equal August (4253.59 km) and September (4223.06 km).
- Road bikes (Road Al and Road Ti) have cycling season from May to early October, Hybrid Al used mainly in spring and autumn.
- There are only two weeks of the year (3rd and 50th) that I have never ridden a bike.
- Most popular week for cycling is 39 (end of September).
- Most popular day of the year is August 25th with 681.51 km covered.
- End of the week from Friday to Sunday is the busiest. Due to the work week, my activity peaks on Sunday and my worst is on Thursday.
- Nearly half of the rides take place in moderate temperatures. Сomfort weather is 34.5%, which is also quite a lot.
- About 2/3 of my rides were under high or very high UV exposure.
- UV index remains consistently high from February to August, and becomes very high in May-July. The safest month for UV radiation is November.
- Daylight hours and UV index vary greatly depending on geographic location, but there is a strong positive correlation between them: the longer the day, the higher the UV index.
- Most often the duration of my rides is from 2 to 4 hours, and I prefer to cycle on those days when there is over 15 hours of daylight.
- I quite successfully managed to avoid heavy rain and only occasionally fell under moderate rain.
- The most humid month is November and the driest month is May. Cloudiness indicators follow humidity.
- 233 rides or 51.9% took place on days when the wind speed was less than 20 km/h, which made it possible not to worry about fighting the wind during training.
- The most popular wind directions are between the west and northwest.
- More than half of the days atmospheric pressure was normal. In the remaining part, increased sea level pressure prevails.
- The approximately equal distribution of rides across lunar phases.
Road Al – Scott USA AFD Comp, aluminium, owned 2017-2022
Conclusion ¶
I hope you enjoyed my project as much as I did!
This analysis was done with an emphasis on Plotly library, as I like its interactivity. But I certainly didn’t forget the best types of graphs from Seaborn and Matplotlib.
This is my Magnum Opus Analyticae – and I'm really proud to present to you all my findings and visualizations!
Everything I have learned so far is reflected in this project.
Feel free to check out my other projects as well! One of them, by the way, is made in R.
I wish everyone who reads this notebook all the best and if you are my new potential employer, I hope I managed to impress you and you are already preparing a job offer :D
Stay cool and stay tuned!